London Bikes data

We’ll use data from http://www.tfl.gov.uk to analyse usage of the London Bike Sharing scheme. This data has already been downloaded for you and exists in a CSV (Comma Separated Values) file that you have to read in to R.

There is no dropdown menu to read in your data to R, so instead we use functions like read_csv to load data from file into R objects.

#read the CSV file
bike <- read_csv(here::here("data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))

Cleaning our data

Sometimes our data needs a bit of ‘cleaning’. For instance, day_of_week is variable type character, or chr. We should, however, treat it as a categorical, or factor variable and relevel it, so Monday is the first level of the factor (or first day of week), etc.

R is fairly sensitive with dates. When you read a CSV file, the date may be in different formats. For instance, Christmas 2017 could be input as 12-25-2017, 25.12.2017, 25 Dec 2017, Dec 25, 2017, etc. To be consistent, we use lubridate’s ymd function, and force variable Day to be a date in the format YYYY-MM-DD

Finally, we can turn season from 1, 2, 3, 4, to words like Winter, Spring, etc.

We will be talking more about data wrangling later, but for now just execute the following piece of code.

# fix dates using lubridate, and generate new variables for year, month, month_name, day, and day_of _week
bike <- bike %>%   
  mutate(
    year=year(date),
    month = month(date),
    month_name=month(date, label = TRUE),
    day_of_week = wday(date, label = TRUE)) 

# generate new variable season_name to turn seasons from numbers to Winter, Spring, etc
bike <- bike %>%  
  mutate(
    season_name = case_when(
      month_name %in%  c("Dec", "Jan", "Feb")  ~ "Winter",
      month_name %in%  c("Mar", "Apr", "May")  ~ "Spring",
      month_name %in%  c("Jun", "Jul", "Aug")  ~ "Summer",
      month_name %in%  c("Sep", "Oct", "Nov")  ~ "Autumn",
    ),
    season_name = factor(season_name, 
                         levels = c("Winter", "Spring", "Summer", "Autumn")))
    
#examine the structure of the datafame
skim(bike)
Data summary
Name bike
Number of rows 4934
Number of columns 20
_______________________
Column type frequency:
character 1
factor 3
logical 1
numeric 14
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
wday 0 1 3 3 0 7 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month_name 0 1 TRUE 12 Jan: 434, Aug: 434, Oct: 434, Dec: 434
day_of_week 0 1 TRUE 7 Sun: 705, Mon: 705, Tue: 705, Wed: 705
season_name 0 1 FALSE 4 Aut: 1274, Win: 1235, Sum: 1229, Spr: 1196

Variable type: logical

skim_variable n_missing complete_rate mean count
weekend 0 1 0.29 FAL: 3524, TRU: 1410

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bikes_hired 0 1.00 26448.57 9677.64 0.0 19604.0 26129.00 33233.75 73094.0 ▂▇▆▁▁
year 0 1.00 2016.82 3.91 2010.0 2013.0 2017.00 2020.00 2024.0 ▆▇▇▇▆
month 0 1.00 6.60 3.47 1.0 4.0 7.00 10.00 12.0 ▇▅▅▅▇
week 0 1.00 26.90 15.16 1.0 14.0 27.00 40.00 53.0 ▇▇▇▇▇
cloud_cover 33 0.99 4.92 2.37 0.0 3.0 5.00 7.00 9.0 ▃▃▆▇▃
humidity 83 0.98 75.71 11.23 33.0 68.0 77.00 84.00 100.0 ▁▂▆▇▃
pressure 31 0.99 10153.10 105.27 9668.0 10091.0 10162.00 10224.00 10477.0 ▁▁▆▇▁
radiation 40 0.99 118.18 89.45 2.0 40.0 93.50 186.00 402.0 ▇▅▃▂▁
precipitation 31 0.99 17.35 38.24 0.0 0.0 0.00 18.00 516.0 ▇▁▁▁▁
snow_depth 302 0.94 0.02 0.23 0.0 0.0 0.00 0.00 7.0 ▇▁▁▁▁
sunshine 31 0.99 41.23 38.97 0.0 5.0 32.00 67.00 147.0 ▇▃▂▂▁
mean_temp 31 0.99 12.05 5.72 -4.1 7.7 11.90 16.60 30.9 ▁▇▇▅▁
min_temp 62 0.99 8.06 5.27 -9.4 4.2 8.25 12.20 22.3 ▁▃▇▇▁
max_temp 31 0.99 16.06 6.61 -1.2 11.0 15.70 21.00 40.2 ▂▇▇▂▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2010-07-30 2024-01-31 2017-04-30 12:00:00 4934

Summary Statistics

Besides the number of bikes hired each day, we also have data on the weather for that day

Having loaded and cleaned our data, we can create summary statistics using mosaic’s favstats. This is uni-variate analysis, so in our mosaic syntax we will use favstats(~ bikes_hired, data= bike). We also want to get an overall, time-series plot of bikes over time; for the latter, we just create a scatter plot of bikes_hired vs Day.

favstats(~ bikes_hired, data= bike)
minQ1medianQ3maxmeansdnmissing
01.96e+042.61e+043.32e+047.31e+042.64e+049.68e+0349340

While this analysis shows us overall statistics, what if we wanted to get summary statistics by year, day_of_week, month, or season? Mosaic’s syntax Y ~ X alows us to facet our analysis of a variable Y by another variable X using the syntax favstats( Y ~ Z, data=...)

favstats(bikes_hired ~ year, data=bike)
yearminQ1medianQ3maxmeansdnmissing
20102.76e+039.3e+03 1.4e+04 1.87e+042.8e+04 1.41e+045.62e+031550
20114.56e+031.63e+042.03e+042.37e+042.94e+041.96e+045.5e+03 3650
20123.53e+031.93e+042.62e+043.25e+044.71e+042.6e+04 9.43e+033660
20133.73e+031.76e+042.2e+04 2.74e+043.56e+042.2e+04 7.28e+033650
20144.33e+032.05e+042.77e+043.44e+044.9e+04 2.75e+049.07e+033650
20155.78e+032.21e+042.66e+043.29e+047.31e+042.7e+04 8.55e+033650
20164.89e+032.24e+042.79e+043.51e+044.66e+042.82e+048.85e+033660
20175.14e+032.41e+042.95e+043.45e+044.6e+04 2.86e+048.38e+033650
20185.86e+032.18e+042.92e+043.77e+044.61e+042.9e+04 1.02e+043650
20195.65e+032.42e+042.89e+043.45e+044.47e+042.86e+048.09e+033650
20204.87e+032.01e+042.75e+043.65e+047.02e+042.85e+041.16e+043660
20216.25e+032.17e+043.1e+04 3.82e+045.69e+043e+04       1.1e+04 3650
20220       2.51e+043.14e+044e+04       6.7e+04 3.15e+041.03e+043650
20236.72e+031.99e+042.37e+042.78e+043.53e+042.34e+046.03e+033650
20248.21e+031.46e+041.88e+042.15e+042.47e+041.8e+04 4.64e+03310
favstats(bikes_hired ~ day_of_week, data=bike)
day_of_weekminQ1medianQ3maxmeansdnmissing
Sun0       1.4e+04 2.1e+04 3e+04       6.31e+042.24e+041.07e+047050
Mon3.97e+032.05e+042.59e+043.19e+046.7e+04 2.62e+048.45e+037050
Tue3.76e+032.24e+042.79e+043.48e+046.53e+042.83e+048.78e+037050
Wed4.33e+032.25e+042.79e+043.46e+045.44e+042.84e+048.73e+037050
Thu5.65e+032.25e+042.75e+043.48e+047.31e+042.84e+048.95e+037040
Fri5.4e+03 2.14e+042.66e+043.37e+046.7e+04 2.72e+048.7e+03 7050
Sat0       1.58e+042.25e+043.15e+047.02e+042.44e+041.13e+047050
favstats(bikes_hired ~ month_name, data=bike)
month_nameminQ1medianQ3maxmeansdnmissing
Jan3.73e+031.39e+041.9e+04 2.33e+043.8e+04 1.87e+045.99e+034340
Feb3.53e+031.6e+04 2.07e+042.46e+045.25e+042.04e+046.53e+033670
Mar5.06e+031.77e+042.34e+042.76e+045.66e+042.29e+047.74e+034030
Apr4.87e+032.11e+042.62e+043.13e+044.9e+04 2.62e+047.78e+033900
May1.2e+04 2.46e+043.04e+043.65e+047.02e+043.07e+048.56e+034030
Jun6.06e+032.79e+043.39e+043.96e+046.53e+043.37e+048.77e+033900
Jul5.56e+032.99e+043.69e+044.17e+047.31e+043.52e+048.37e+034050
Aug4.3e+03 2.55e+043.33e+043.89e+046.7e+04 3.16e+049.83e+034340
Sep0       2.51e+043.18e+043.65e+045.18e+043.06e+048.26e+034200
Oct7.07e+032.33e+042.8e+04 3.27e+044.79e+042.75e+046.9e+03 4340
Nov6.03e+031.87e+042.37e+042.8e+04 4.47e+042.32e+046.52e+034200
Dec2.76e+031.16e+041.66e+042.29e+043.91e+041.71e+046.94e+034340
favstats(bikes_hired ~ season_name, data=bike)
season_nameminQ1medianQ3maxmeansdnmissing
Winter2.76e+031.37e+041.88e+042.35e+045.25e+041.86e+046.63e+0312350
Spring4.87e+032.08e+042.65e+043.18e+047.02e+042.66e+048.66e+0311960
Summer4.3e+03 2.77e+043.45e+044e+04       7.31e+043.35e+049.15e+0312290
Autumn0       2.18e+042.74e+043.29e+045.18e+042.71e+047.86e+0312740

Exploratory Data Analysis

Time series plot of bikes rented

While summary statistics allow us to quickly disover key metrics that represent the data, they are often not sufficient by themselves. Often, it is useful to represent the data graphically (or visually) to uncover information that is critical for decision making, such as trends, patterns and outliers.

In this section we will create a time-series scatter-plot that shows the number of bikes hired on each day. We will use the ggplot2 library.

Creating a basic plot is a two-step process. The first step is the specify the data frame and the axes by using the syntax: ggplot(data frame, aes(x=variable, y=variable). The second step is to specify the plot that we want. In this case, we want a scatter (point) plot using geom_point() after a + sign.

ggplot(bike, aes(x=date, y=bikes_hired))+
  geom_smooth()+
  geom_point(alpha = 0.3)+
  theme_bw()+
  NULL
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Further graphs

# Histogram of bikes rented
ggplot(bike, aes(x=bikes_hired))+
  geom_histogram()+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram faceted by season_name
ggplot(bike, aes(x=bikes_hired))+
  geom_histogram()+
  facet_wrap(~season_name)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram faceted by month_name
ggplot(bike, aes(x=bikes_hired))+
  geom_histogram()+
  facet_wrap(~month_name)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram faceted by month_name in 4 rows
ggplot(bike, aes(x=bikes_hired))+
  geom_histogram()+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Density plot 
ggplot(bike, aes(x=bikes_hired))+
  geom_density()+
  theme_bw()+
  NULL

# Density plot filled by season_name 
ggplot(bike, aes(x=bikes_hired))+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  theme_bw()+
  NULL

# Density plot filled by season_name, and faceted by season_name
ggplot(bike, aes(x=bikes_hired))+
  geom_histogram(aes(fill=season_name), alpha = 0.5)+
  facet_wrap(~season_name, nrow = 4)+
  theme_minimal()+
  
  #remove legend to the right
  theme(legend.position = "none")+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Density plot filled by season_name, and faceted by month_name
ggplot(bike, aes(x=bikes_hired))+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  theme(legend.position="none")+
  NULL

#Boxplot of bikes_hired  by month
# since 'month' is a number, it treats it as a continuous variable; hence we get just one box
ggplot(bike, aes(x=month, y= bikes_hired))+
  geom_boxplot()+
  theme_bw()+
  NULL
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

#Boxplot  by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired))+
  geom_boxplot()+
  theme_bw()+
  NULL

#Boxplot by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired, fill=season_name))+
  geom_boxplot()+
  theme_bw()+
  NULL

#Violin plot by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired))+
  geom_violin()+
  theme_bw()+
  NULL

# bikes_hired vs. weather features
ggplot(bike, aes(x=mean_temp, y= bikes_hired))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 31 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(bike, aes(x=mean_temp, y= bikes_hired, colour=season_name))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  #  facet_wrap(~season_name, ncol=1)+
  NULL
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 31 rows containing missing values or values outside the scale range
## (`geom_point()`).

###### bikes on humidity
ggplot(bike, aes(x=humidity, y= bikes_hired))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 83 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 83 rows containing missing values or values outside the scale range
## (`geom_point()`).

Model building

Correlation - Scatterplot matrix

Besides the number of bikes rented out on a given day, we have also downloaded weather data for London such as mean_temp, humidity, pressure, precipitation, etc. that measure weather conditions on a single day. It may be the case that more bikes are rented out when it’s warmer? Or how can we estimate the effect of rain on rentals?

Your task is to build a regression model that helps you explain the number of rentals per day.

Let us select a few of these numerical variables and create a scatterplot-correlation matrix

bike %>% 
  select(cloud_cover, humidity, pressure, radiation, precipitation, sunshine, mean_temp, bikes_hired) %>% 
  GGally::ggpairs()

Weekend or weekdays? Is there a difference?

t.test(bikes_hired ~ weekend, data= bike)
## 
##  Welch Two Sample t-test
## 
## data:  bikes_hired by weekend
## t = 13.074, df = 2150, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  3664.374 4957.662
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            27680.53            23369.52

Model 0: using the mean to predict bikes_hired

We start the naive model where we just use the average to predict how many bikes we are going to rent out on a single day

favstats(~bikes_hired, data = bike)
minQ1medianQ3maxmeansdnmissing
01.96e+042.61e+043.32e+047.31e+042.64e+049.68e+0349340
# can you create a confidence interval for mean bikes_hired? What is the SE?

model0 <- lm(bikes_hired ~ 1, data= bike)
msummary(model0)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26448.6      137.8     192   <2e-16 ***
## 
## Residual standard error: 9678 on 4933 degrees of freedom
# plot actual data vs predicted data
broom::augment(model0) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

What is the regression’s residual standard error? What is the intercept standard error?

Model 1: bikes_hired on mean_temp

# Define the model
model1 <- lm(bikes_hired ~ mean_temp, data = bike)

# look at model estimated
msummary(model1)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13662.54     251.96   54.23   <2e-16 ***
## mean_temp    1061.12      18.89   56.18   <2e-16 ***
## 
## Residual standard error: 7565 on 4901 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.3918, Adjusted R-squared:  0.3916 
## F-statistic:  3157 on 1 and 4901 DF,  p-value: < 2.2e-16
# Diagnostics
autoplot(model1)

# plot actual data vs predicted data
broom::augment(model1) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

  • Is the effect of mean_temp significant? Why?
  • What proportion of the overall variability in bike rentals does temperature explain?

Model 2: bikes_hired on mean_temp plus weekend

# Define the model
model2 <- lm(bikes_hired ~ mean_temp + weekend, data = bike)

# look at model estimated
msummary(model2)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14889.50     252.84   58.89   <2e-16 ***
## mean_temp    1059.68      18.27   57.99   <2e-16 ***
## weekendTRUE -4236.09     231.46  -18.30   <2e-16 ***
## 
## Residual standard error: 7320 on 4900 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.4307, Adjusted R-squared:  0.4304 
## F-statistic:  1853 on 2 and 4900 DF,  p-value: < 2.2e-16
# Diagnostics
autoplot(model2)

# plot actual data vs predicted data
broom::augment(model2) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

  • Fit a regression model called model2 with the following explanatory variables: mean_temp and weekend

    • Is the effect of mean_temp significant? Why?
    • What proportion of the overall variability does this model explain? What is the meaning of the effect (slope) of weekendTRUE? What % of the variability of bikes_hired does your model explain?

Model 3: bikes_hired on mean_temp plus wday

# Define the model
model3 <- lm(bikes_hired ~ mean_temp + wday, data = bike)

# look at model estimated
msummary(model3)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14360.37     351.73  40.828  < 2e-16 ***
## mean_temp    1058.72      18.15  58.327  < 2e-16 ***
## wdayMon      -862.28     388.50  -2.220  0.02650 *  
## wdaySat     -2717.05     388.49  -6.994 3.04e-12 ***
## wdaySun     -4673.96     388.50 -12.031  < 2e-16 ***
## wdayThu      1170.71     388.49   3.013  0.00260 ** 
## wdayTue      1204.17     388.36   3.101  0.00194 ** 
## wdayWed      1189.48     388.35   3.063  0.00220 ** 
## 
## Residual standard error: 7271 on 4895 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.4389, Adjusted R-squared:  0.4381 
## F-statistic: 547.1 on 7 and 4895 DF,  p-value: < 2.2e-16
# Diagnostics
autoplot(model3)

# plot actual data vs predicted data
broom::augment(model3) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

What is the meaning of the effect (slope) of, e.g., wdayMon? What % of the variability of bikes_hired does your model explain?

model4 <- lm(bikes_hired ~ mean_temp + wday + humidity + factor(month), data = bike)

# look at model estimated
msummary(model4)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     40543.12    1037.02  39.096  < 2e-16 ***
## mean_temp         675.83      32.00  21.117  < 2e-16 ***
## wdayMon          -934.03     353.54  -2.642 0.008269 ** 
## wdaySat         -2684.17     352.88  -7.607 3.36e-14 ***
## wdaySun         -4731.88     352.90 -13.409  < 2e-16 ***
## wdayThu          1177.70     353.77   3.329 0.000878 ***
## wdayTue          1316.35     353.29   3.726 0.000197 ***
## wdayWed          1352.95     353.43   3.828 0.000131 ***
## humidity         -297.67      10.86 -27.421  < 2e-16 ***
## factor(month)2   -186.30     480.70  -0.388 0.698354    
## factor(month)3   -169.84     480.47  -0.353 0.723733    
## factor(month)4   -443.27     517.14  -0.857 0.391400    
## factor(month)5   1666.85     552.42   3.017 0.002563 ** 
## factor(month)6   2556.28     608.73   4.199 2.73e-05 ***
## factor(month)7   2308.79     663.95   3.477 0.000511 ***
## factor(month)8   -110.33     626.01  -0.176 0.860106    
## factor(month)9   1849.24     576.47   3.208 0.001346 ** 
## factor(month)10  3033.28     513.25   5.910 3.66e-09 ***
## factor(month)11  2839.25     470.89   6.030 1.77e-09 ***
## factor(month)12 -1847.66     458.27  -4.032 5.62e-05 ***
## 
## Residual standard error: 6578 on 4831 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.5431, Adjusted R-squared:  0.5413 
## F-statistic: 302.2 on 19 and 4831 DF,  p-value: < 2.2e-16
# plot actual data vs predicted data
broom::augment(model4) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

model5 <- lm(bikes_hired ~ mean_temp + wday + min_temp + max_temp + sunshine +
               humidity + factor(month) + radiation + precipitation, data = bike)

# look at model estimated
msummary(model5)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     22369.658   1306.424  17.123  < 2e-16 ***
## mean_temp        1283.377     98.423  13.039  < 2e-16 ***
## wdayMon         -1041.150    333.557  -3.121 0.001811 ** 
## wdaySat         -2779.064    333.042  -8.344  < 2e-16 ***
## wdaySun         -4830.052    333.003 -14.505  < 2e-16 ***
## wdayThu          1026.337    333.830   3.074 0.002121 ** 
## wdayTue          1078.955    333.400   3.236 0.001219 ** 
## wdayWed          1232.349    333.819   3.692 0.000225 ***
## min_temp         -702.018     80.299  -8.743  < 2e-16 ***
## max_temp           61.535     37.720   1.631 0.102879    
## sunshine           14.627      4.815   3.038 0.002396 ** 
## humidity         -104.280     14.181  -7.353 2.26e-13 ***
## factor(month)2   -151.918    454.385  -0.334 0.738139    
## factor(month)3   -689.338    470.786  -1.464 0.143197    
## factor(month)4  -1574.691    541.026  -2.911 0.003624 ** 
## factor(month)5    318.321    606.483   0.525 0.599702    
## factor(month)6   1215.583    661.607   1.837 0.066224 .  
## factor(month)7   1327.477    702.313   1.890 0.058798 .  
## factor(month)8   -981.378    649.183  -1.512 0.130673    
## factor(month)9    757.755    570.252   1.329 0.183975    
## factor(month)10  2693.706    491.658   5.479 4.50e-08 ***
## factor(month)11  2546.745    444.625   5.728 1.08e-08 ***
## factor(month)12 -1887.717    432.223  -4.367 1.28e-05 ***
## radiation          13.751      2.694   5.105 3.44e-07 ***
## precipitation     -36.021      2.502 -14.398  < 2e-16 ***
## 
## Residual standard error: 6198 on 4817 degrees of freedom
##   (92 observations deleted due to missingness)
## Multiple R-squared:  0.593,  Adjusted R-squared:  0.591 
## F-statistic: 292.4 on 24 and 4817 DF,  p-value: < 2.2e-16
# plot actual data vs predicted data
broom::augment(model5) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

vif(model5)
##                    GVIF Df GVIF^(1/(2*Df))
## mean_temp     39.624063  1        6.294765
## wday           1.006419  6        1.000533
## min_temp      22.449048  1        4.738043
## max_temp       7.792187  1        2.791449
## sunshine       4.425835  1        2.103767
## humidity       3.195067  1        1.787475
## factor(month)  9.318188 11        1.106778
## radiation      7.322816  1        2.706070
## precipitation  1.158148  1        1.076173

Further variables/questions to explore on your own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  • Are other weather variables useful in explaining bikes_hired?
  • We also have data on days of the week, month of the year, etc. Could those be helpful?
  • What’s the best model you can come up with?
  • Is this a regression model to predict or explain? If we use it to predict, what’s the Residual SE?

Ploting your best model predicitons vs reality

Let us say that your best model is model3. How do the predictions of your model compare to the actual data?

# plot actual data vs predicted data

my_best_model <- broom::augment(model3) %>% 
  mutate(day=row_number())



# Plot fitted line and residuals
ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

  1. Check the residuals, using autoplot(model_x). You will always have some deviation from normality, especially for very high values of n
  2. As you start building models with more explanatory variables, make sure you use car::vif(model_x) to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.
# Check VIF

vif(model2)
## mean_temp   weekend 
##  1.000019  1.000019
vif(model3)
##               GVIF Df GVIF^(1/(2*Df))
## mean_temp 1.000122  1        1.000061
## wday      1.000122  6        1.000010
vif(model4)
##                   GVIF Df GVIF^(1/(2*Df))
## mean_temp     3.734827  1         1.93257
## wday          1.001440  6         1.00012
## humidity      1.667095  1         1.29116
## factor(month) 4.796664 11         1.07387

Comparison of different models

Create a summary table, using huxtable (https://am01-sep23.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error. If you want to add more models, just make sure you do not forget the comma , after the last model, as shown below

# produce summary table comparing models using huxtable::huxreg()
huxreg(model0, model1, model2, model3, model4,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)
(Intercept)26448.566 ***13662.535 ***14889.505 ***14360.370 ***40543.121 ***
(137.775)   (251.955)   (252.836)   (351.727)   (1037.024)   
mean_temp        1061.121 ***1059.678 ***1058.724 ***675.830 ***
        (18.886)   (18.274)   (18.152)   (32.004)   
weekendTRUE                -4236.092 ***                
                (231.456)                   
wdayMon                        -862.277 *  -934.026 ** 
                        (388.501)   (353.535)   
wdaySat                        -2717.051 ***-2684.169 ***
                        (388.493)   (352.875)   
wdaySun                        -4673.962 ***-4731.880 ***
                        (388.498)   (352.899)   
wdayThu                        1170.713 ** 1177.698 ***
                        (388.492)   (353.771)   
wdayTue                        1204.172 ** 1316.354 ***
                        (388.356)   (353.287)   
wdayWed                        1189.475 ** 1352.949 ***
                        (388.353)   (353.431)   
humidity                                -297.671 ***
                                (10.856)   
factor(month)2                                -186.303    
                                (480.700)   
factor(month)3                                -169.843    
                                (480.469)   
factor(month)4                                -443.267    
                                (517.136)   
factor(month)5                                1666.850 ** 
                                (552.423)   
factor(month)6                                2556.277 ***
                                (608.733)   
factor(month)7                                2308.789 ***
                                (663.946)   
factor(month)8                                -110.333    
                                (626.011)   
factor(month)9                                1849.242 ** 
                                (576.471)   
factor(month)10                                3033.282 ***
                                (513.251)   
factor(month)11                                2839.252 ***
                                (470.891)   
factor(month)12                                -1847.660 ***
                                (458.271)   
#observations4934        4903        4903        4903        4851        
R squared0.000    0.392    0.431    0.439    0.543    
Adj. R Squared0.000    0.392    0.430    0.438    0.541    
Residual SE9677.635    7565.409    7320.118    7270.602    6577.861    
*** p < 0.001; ** p < 0.01; * p < 0.05.